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ABSTRACT 

The main disadvantages of independent checkpointing are the possible domino effect and 
the associated storage space overhead for maintaining multiple checkpoints. In most previous 
work, it has been assumed that only the checkpoints older than the current global recovery 
line can be discarded. In this paper, we generalize the notion of recovery line to potential 
recovery line. Only the checkpoints belonging to at least one of the potential recovery lines 
can not be discarded. By using the model of maximum-sized antichains on a partially ordered 
set, an efficient algorithm is developed for finding all non-discardable checkpoints and we 
show that the number of non-discardable checkpoints can not exceed N(N + l)/2 where N 
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is the number of processors. Communication-trace-driven simulation for several ^P ercube 
programs is performed to show the benifit of the proposed algorithm for real applications. 


Key words: fault tolerance, message-passing systems, independent checkpointing, recovery 
lines, checkpoint space reclamation 


I INTRODUCTION 


Numerous checkpointing and rollback recovery techniques have been proposed in the lit- 
erature for parallel systems. They can be classified into two categories. Coordinated check- 
pointing schemes synchronize computation with checkpointing by coordinating processors 
during a checkpointing session in order to maintain a consistent set of checkpoints [1, 2, 3]. 
Each processor only keeps the most recent checkpoint and rollback propagation is avoided 
at the cost of potentially significant performance degradation during normal execution. In- 
dependent checkpointing schemes replace the above synchronization by dependency tracking 
and possibly message logging [4, 5, 6, 7] in order to preserve process autonomy. Possible roll- 
back propagation in case of a fault is handled by reconstruction of a consistent system state 
based on the dependency information. Lower run-time overhead during normal execution is 
achieved by maintaining multiple checkpoints and allowing slower recovery. 

This paper considers the independent checkpointing schemes. Most research on this 
subject has concentrated on algorithms for finding the latest consistent set of checkpoints, 
i.e., the recovery line , during rollback recovery. The same algorithms can be applied to 
the set of existing checkpoints during normal execution to find the current global recovery 
line. All the checkpoints older than the current global recovery line then become obsolete 
checkpoints and therefore can be discarded. When the domino effect [3, 8] occurs, large 
number of non-obsolete checkpoints have to be kept on the stable storage and result in large 
space overhead. 

Our approach is based on the observation that many non-obsolete checkpoints can also be 
discarded because they will never become members of any future recovery line. The notion of 
recovery line is generalized to potential recovery line. A checkpoint is non-discardable if and 
only if it belongs to at least one of the potential recovery lines. By modeling a recovery line 
as the maximum maximum-sized antichain on a partially ordered set, an efficient algorithm 
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is presented for finding the union of all potential recovery lines, which gives the set of non- 
discardable checkpoints. A maximum on the sire of this set is also derived to show that even 
when domino effect persists during program execution, the space overhead for maintaining 

multiple checkpoints will not grow without limit. 

The outline of the paper is as follows. Section II describes the system model; background 

materials are introduced in Section III; Section IV formulates the problem; Section V gives 
the necessary and sufficient conditions for a checkpoint to be non-discardable; the check- 
point space reclamation algorithm is developed in Section VI; the maximum number of 
non-discardable checkpoints is derived in Section VII; Section VIII extends the results to 
recovery protocols using virtual checkpoints and Section IX is the conclusion. 


II SYSTEM MODEL 


A Checkpointing and Rollback Recovery 

The system model considered in this paper is a message-passing system consisting of a 
number of concurrent processes for which all process communication is through message 
passing. Processes are assumed to run on fail-stop processors [9] and each processor is 
considered an individual recovery unit. 

During normal execution, the state of each processor is occasionally saved as a checkpoint 
on stable storage and can be reloaded for rollback recovery in case of a detected error. Let 
CPik denote the kth. checkpoint of processor pi with k > 0 and 0 < i < N - 1, where 
N is the number of processors. A checkpoint interval is defined to be the time between 
two consecutive checkpoints on the same processor. Each processor takes its checkpoint 
independently, i.e. without synchronizing with any other processors, and includes in each 
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checkpoint the communication information containing. 

1. its own processor number and checkpoint number and 

2. the sender’s processor number and checkpoint number tagged on each message it has 
received during the previous checkpoint interval. 

A centralized checkpoint space reclamation algorithm can be invoked by any processor occa- 
sionally to collect the global communication information, construct the dependency graph, 
determine the set of obsolete checkpoints and reclaim the storage space. 

We consider two different rollback recovery procedures, Schemes A and B. Scheme A 
basically follows the algorithm described by Bhargava and Lian [6] and is summarized as 
follows. When a processor p, detects an error, it starts a two-phase centralized recovery 
procedure. First, a rollback-initiating message is sent to every other processor to request the 
up-to-date communication information. Each surviving processor takes a virtual checkpoint 
upon receiving the rollback-initiating message so that the communication information dur- 
ing the most recent checkpoint interval is also collected. After receiving the responses, p, 
constructs the complete dependency graph and executes the rollback propagation algorithm 
(described in the next section) to determine the recovery line. A rollback-request message 
is then sent to each processor. The message requests each involved processor to reload the 
checkpoint in the recovery line and restart. 

In this paper, Scheme B is proposed as a variation of Scheme A. Instead of a virtual 
checkpoint, a real checkpoint is taken by each surviving processor upon receiving the rollback- 
initiating message. The recovery line then consists of all real checkpoints. This modified 
scheme takes advantage of the coordination needed for recovery and can often force the 
current recovery line to move forward. Each processor can then discard all checkpoints 
except the one belonging to this recovery line. Rollback propagation for possible recovery in 
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the future is therefore bounded by this new recovery line. 

B Consistency of Checkpoints 

There are two situations concerning the consistency between two checkpoints. In Fig. 1(a), if 
pi and restart from the checkpoints CP* and CP,„ respectively, the message m is recorded 
as "received but not yet sent”. In a general model without the assumption of deterministic 
execution, message m is not guaranteed to be re-sent during reexecution. CP* and CP im 
are thus inconsistent. 




Figure 1: (a) Inconsistent checkpoints; (b) consistent checkpoints. 

Fig. 1(b) illustrates the second situation. The message m is recorded as "sent but not yet 
received” according to the system state containing CP ik and CP jm - By defining the state of 
the channels to be the set of messages sent but not yet received, it has been proved [2, 10] 
that checkpoints like CP lk and CP jm can be considered consistent if the corresponding state 
of the channels is also recorded. In Koo and Toueg’s paper [3], such state was assumed to 
be recorded at the sender side in the form of lost messages and the set of messages was 
guaranteed to be delivered reliably by some end-to-end transmission protocol. When it is 
impossible or difficult to implement the above scheme, pessimistic message logging [11,12,13] 
can ensure the state of the channels is properly recorded at the receiving end. As a result, 
we consider the situation in Fig. 1(b) as consistent. 
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Ill PRELIMINARIES 


A Partially Ordered Set and Checkpoint Graph 

In a message-passing system, an event a happens before event b [14] if and only if 

1. a and b are events in the same processor, and a occurs before 6; or 

2. a is the sending of a message by one processor and b is the receiving of the same 
message by another processor; or 

3. a happens before c and c happens before b. 

The set of events with the "happens before” relation forms a partially ordered set , or poset 
[14]. When dealing with the problem of finding a consistent set of checkpoints, we only 
consider the induced subposet [15] P = (C, <), where C is the set of all checkpoints and < is 
the ’’happens before” relation. 

A checkpoint graph (CPG), of which the transitive closure is the poset P, is a directed 
acyclic graph constructed as follows [4]. Each vertex on the checkpoint graph represents a 
checkpoint. A directed edge exists from vertex CPj m to vertex CPik if j = * an d k — m + 1, 
or j ^ i and there exists a message sent by processor pj between CPj m and C Pj ( m + 1 ) and 
received by processor Pi between CP i(k -i) and CP ik . Fig. 2 gives an example of CPG with 
its corresponding communication pattern. 

Most of the ideas in this paper will be illustrated by the better visualized CPG instead 
of the more abstract poset. An element a in a poset is maximal (minimal) if there does 
not exist any element b such that a < b (b < a); correspondingly, a vertex in a CPG will 
be referred as maximal (minimal) if it has no outgoing (incoming) edge. Also, the following 
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Figure 2: (a) The checkpoint and communication pattern; (b) the corresponding checkpoint 
graph 


terminologies will be used interchangeably: a < b, a is "smaller than 
a a can "strictly reach” 6 and 6 is "strictly reachable from” a. 


6, b is "greater than" 


B Maximum-Sized Antichain and Recovery Line 

A partial ordering of a set 5 is linear if for every two elements a and b in S, either a < b ov 
b < a [15]. In a poset, a subset whose elements are linearly ordered is called a chain and a 
set of elements, no two of which are comparable, is called an antichain. In particular, a set 
of any number of maximal (minimal) elements clearly forms an antichain. The antichains 
with the largest number of elements are called the maximum-sized antichains or M-chains 
for short. Let A(Q) denote the set of antichains on a poset Q and, for A, B € A(Q), define 
Ad B if and only if for all a € A there exists b € B such that a < b. [16]. Also let M(Q) 
denote the set of maximum-sized antichains. We then have the following properties. 

LEMMA 1 (1) (A{Q),d) forms a poset; 

(2) (A(Q), d) is a lattice and its subposet (M(Q), d) is a sublattice; 

(3) For Mi, € M{Q), the join (least upper bound) Mi V M 2 = max{M\ U M 2 ) and the 
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meet (greatest lower bound) Mi A M 2 = min(Mi U M a ), where max(S) denote the set of 
maximal elements in S and min(S) is similarly defined [17, 16]. 


Since (M(Q), 1) is a finite lattice, there must exist a unique maximum member M*(Q), 
called the maximum maximum-sized antichain or MM-chain, such that M < M (Q) for 
every M € Ai(Q). 

LEMMA 2 For any M € M(Q), there must not exist any a € M m (Q) such that a < b for 
be M. 

Proof. Suppose there exist such a € M'(Q) and b € M. M < M'(Q) implies there exists 

c £ M-(Q) such that b < c. Together with a < 6, this leads to a < c, contradicting the fact 

□ 

that M m (Q) is an antichain. 

In this paper, we define a global checkpoint to be a set of checkpoints, one from each 
processor. Based on the discussion on consistency in the previous section, a consistent global 
checkpoint is a set of checkpoints, one from each processor and no two of which are comparable 
through the "happens before” relation. A recovery line refers to the latest consistent global 
checkpoint. Because one special feature of the poset P - ( C , <) is that there always 

exists a natural chain decomposition C 0 , C u ..., Cjv-i where Ci is the set of all checkpoints of 
processor p,, the size d(P) of the M-chains cannot be greater than N. Furthermore, because 
the first checkpoint of every processor must be minimal and the set of such checkpoints 
always forms an antichain of size N, d{P) is equal to N and each M-chain will consist of N 
elements, one from each C,. It becomes clear that each M-chain is equivalent to a consistent 
global checkpoint. Since it is always desirable to rollback to the most recent consistent global 
checkpoint in order to minimize the recovery cost, Lemma 1 guarantees the existence and 
uniqueness of such a recovery line, i.e., the MM-chain. 
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C Ideal, Filter and The Reachable Set 

Given a poset P, if J is a set of elements of P with the property 

a £ 1 and b < a =>■ b £ 1, 

J is called an ideal or a down-set of P. Similarly, a filter or an up-set, T, of P is a set of 
elements such that if a £ P and a < b, then b £ P • 

For an antichain A in P, define 

1(A) = {x £ P : x < a for some a £ A} 

P(A) = {x £ P : x < a for some a £ A}. 

Then 1(A) is an ideal [16] and P(A) is a filter. 

LEMMA 3 A and B are antichains, then [16] 

( 1 ) 

1(A) C 1(B) 

( 2 ) 

P(A) C P(B) <=> B 1A. 

In terms of the CPG, the set of vertices which can reach any vertex in an antichain A is 
equal to 1(A) and the set of vertices reachable from any vertex in A is equal to P(A). 


D The Rollback Propagation Algorithm 

The algorithm for finding the recovery line will form the basis of our checkpoint space 
reclamation algorithm. The problem of finding the MM-chain in a general poset can be 
transformed into that of finding a maximum matching on a bipartite graph [18]. For the 
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poset P = (C, <) in our problem, a simpler rollback propagation algorithm , shown in Fig. 3, 
has been proposed [4] and applied to the CPG. 


/* CP stands for checkpoint */ 

/* Initially, all the CPs are unmarked */ 

include the latest CP of each processor in the root set; 
mark all CPs strictly reachable from any CP in the root set; 
while (at least one CP in the root set is marked) { 

replace each marked CP in the root set by the latest unmarked CP on the 
same processor; 

mark all CPs strictly reachable from any CP in the root set; 

} 

the root set is the recovery line. 


Figure 3: The Rollback Propagation Algorithm 

The complexity of the algorithm is linear in the number of edges because each edge can 
be removed after it is used to reach some vertex and therefore visited at most once. 


IV PROBLEM FORMULATION 

We first define the potential recovery line of a given checkpoint graph G as the recovery 
line of any checkpoint graph G' which can possibly evolve from G during program execution 
in the future. Since the purpose of keeping checkpoints is for possible future recovery, a 
checkpoint is discardable if and only if it does not belong to any potential recovery line. 
Being older than the current global recovery line is simply a sufficient condition for being a 
discardable checkpoint but not a necessary condition. We will show there exist checkpoints 
not older than the global recovery line yet discardable. 
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Scheme A and B for rollback recovery, as described in Section II, present different levels 
of difficulty for the problem of identifying discardable checkpoints. In Scheme B, a recovery 
line exists after each recovery such that all the older checkpoints can be discarded and 
all the newer checkpoints are invalidated by the rollback (Fig. 4(a) and (c)). Therefore, 
each recovery will start a new checkpoint graph and we only have to consider the potential 
recovery lines of the existing graph up to next recovery. Scheme A can be viewed as a more 
general case of Scheme B. Some checkpoints will be invalidated due to the rollback and result 
in a checkpoint graph which is a subgraph of the one before recovery (Fig. 4(a) and (b)). 
A checkpoint is discardable only when it will never belong to any recovery line no matter 
how many times the recovery occurs. We first consider Scheme B in the following sections 
and then show in Section VIII that the same necessary and sufficient conditions are also 
applicable to Scheme A. 

Although the execution time for a normal program is finite, the possibility of augmenting 
the existing CPG by adding new vertices is enormous because the communication pattern 
is in general unpredictable. By recognizing the following rules for adding new vertices to a 
checkpoint graph, we are able to reduce the almost infinite number of situations to finite 
cases for the problem of identifying minimum number of non-discardable checkpoints. For 

each new vertex C Pik, 

Rule 1: C Pik must have an incoming edge from C P^k- 1 ) except for the first vertex on each 
chain which has no incoming edge; 

Rule 2: C Pik can have incoming edges from arbitrary existing vertices. But it can not have 
any outgoing edge to any existing vertex. 

Note that a checkpoint C Pik that happens before CP jm may not be collected before 
CPjm ■ However, such a situation can be detected by the communication information. If 
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Figure 4: Checkpoint graphs (a) at the time of recovery; (b) after recovery with Scheme A; 
(c) after recovery with Scheme B. (Dashed lines represent the communication information 
from virtual checkpoints. Black vertices indicate the current global recovery line before 
recovery and gray vertices form the local recovery line at the time of recovery.) 

a vertex CPj m is expecting an incoming edge from a non-existing vertex CP,k, C P jm and 
its associated incoming edges will be excluded from the existing CPG. By adding each new 
vertex under this constraint, none of the new vertices can have edges pointing to any existing 
vertex and, therefore, Rule 2 is enforced. The following important property is ensured by 
Rule 2. 

PROPERTY 1 Adding a new vertex v and its associated incoming edges to an existing 
CPG can not change the relation between any pair of existing vertices. 
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Proof. The relation between any pair of existing vertices will be changed only if one 
vertex is smaller than v and the other one is greater than v. However, Rule 2 guarantees 
none of the existing vertices is greater than v. Therefore, the property holds. u 

Let Qj(G ) denote the set of all future graphs obtainable by adding new vertices to a given 
checkpoint graph G according to the above rules. Lemma 4 gives the relationship between 
the antichains of G and those of its future graphs. 

LEMMA 4 Given G = (V,E) and G' G Gf(G), 

(1) A(G) C A(G'); 

(2) A G A(G') and ACV => Ae A(G); 

(3) M(G) C M(G’); 

(4) M G M(G') and M C V =» M 6 M(G); 

(5) M*(G) X 


Proof. (1) and (2) follow immediately from Property 1. By Rule 1 and the discussion 
after Lemma 2, the size of the maximum-sized antichains is always fixed and equal to the 
number of processors. So (3) and (4) holds. In particular, M m (G) G Ai(G) C A i(G ) implies 
M*(G) r< M-(G'). D 

Let Nd(G ) denote the set of non-discardable checkpoints of a given graph G. By defini- 
tion, we have 

Nd(G) = {u : v G G and v G M*(G') for some G 1 G Gf(G)}. (1) 

Our goal is to develop an algorithm to efficiently find the set Nd(G). 
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V THE SET OF NON-DISCARD ABLE 

CHECKPOINTS 


One feature of Scheme B is that the checkpoint graph is always growing until a rollback 
recovery occurs, after which a new checkpoint graph starts. In this case, it is clear by the 
definition in Eq. 1 that N D (G') n G C N D (G) for any G' € <7/(G) because 0/(G') C Q f (G). 
In other words, once a checkpoint is determined to be discardable, it will never become 
non-discardable for any future graph. Note that a discardable checkpoint v can be removed 
from the graph by the following procedure without affecting any recovery line in the future. 

1. A new edge is generated for each pair of incoming and outgoing edges of v in order to 
preserve the relations implied through v among the remaining vertices. 

2. The source vertices of all the incoming edges of v have to be remembered. When an 
outgoing edge of v is added in the future, it is replaced by the outgoing edges from 
these source vertices. 

However, since we are mainly concerned about checkpoint space reclamation, we will lea\e 
all the vertices corresponding to discardable checkpoints in the graph for simplicity. 

One special future graph of G, G, will play a very important role throughout this paper 
and is constructed as follows: 

1. adjoin N new vertices no, nj, ... , n^-i to G; 

2. an edge is added from the last vertex on each chain C, to n, as shown in Fig. 5. 

Let V n denote the set of all such n^’s and V e denote the set of e,-’s. We now prove the following 
necessary and sufficient conditions for a checkpoint to be non-discardable. 
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Figure 5: Construction of the future graph G by adding n^s to the checkpoint graph G. 

THEOREM 1 Given a checkpoint graph G = (V, E) and v G G, 

v G M m (G') for some G' = (V', E') G Gj{G) 

i f and only if v € M‘{G - W) for some W C V n . 

Proof The i/part is trivial because G - W G Gf(G). We now prove the only if part. 
If y G M*(G') for some G' G G/(G), let M'(G') = M,UM 2 such that M x = M m {G’) n V 
and Mi = M 9 {G') \ Mi as shown in Fig. 6(a). In particular, v G M\. Define p(u) = i if u 
represents a checkpoint of processor p< and decompose the set V n as V n = By U B 2 where 

B\ = {n p (u) : u ^ Mi} 

B 2 = {«p( u ) : u ^ M 2 }. 

We want to show that M\ U B 2 — M*(G — By) (Fig. 6(b)). 

First we prove Mi U B 2 G AA(G — By). Consider the graph G . For every u G M 2 , 
e p ( u ) < u by Rule 1. According to Lemma 3, T(e p ( u )) C X{u). Since u and all the vertices in 
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Po 


Pi 


P2 


P3 


P4 




<b) 


Figure 6: Suppose (a) Mi U M 2 forms the MM-chain of G', then (b) M x U B 2 forms the 
MM-chain of G — B\. 

Ma belong to the same antichain, Mi n Z(«) = 0. It follows that M x n J(e p(u) ) = 0. Now 
consider the graph G - B x . The above equation still holds because of Property 1. By the 
construction of G, Z(n p{u) ) = Z(e p(u) ) U {n p(u) } and therefore Mi n J(n p(u) ) = 0. Since Rule 2 
implies Mi n^(n p(u) ) = 0, we have proved that every vertex n p(u) in B 2 is incomparable with 
every vertex in Mi. Because Mi is an antichain by itself and all n p(u )’s in B 2 are maximal 
Mi U B 2 € M{G - B x ). 

Next we prove M,UB 2 = M*(G - B x ) by contradiction. Because every vertex in B 2 is 
maximal on the chain it belongs to, B 2 C M*(G - B x ) by Lemma 2. Suppose M x U B 2 ± 
M-(G - B x ). There must exist M[ = NT (6 - B x ) \ B 2 such that M x X M{ and Mi ^ M(. 
We then have T{M[) C Z"(Mi) by Lemma 3. Recall Mi and M 2 form an antichain in the 
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graph G', which implies M 2 D P(M X ) = 0. Thus M 2 n ^(M() - 0 and, because Rule 2 
guarantees M 2 fl 1(M[) = 0, M[ UM 2 € The fact that Mi UM 2 is a greater M-chain 

than Mi U M 2 in G' contradicts M m (G') = M x U M 2 . Hence, Mi U B 2 — M (G £?i). 

It immediately follows that if u £ M*(G') /or some G' £ £/(<?), v E M x C M*(G - 50 

□ 

for B\ C V n . 

The contribution of Theorem 1 is that it classifies the enormous number of possibilities 
for any existing vertex to become a member of some MM-chain in the future into exponential 
number of cases. If we apply the rollback propagation algorithm to each of the 2^ graphs 
(5_ VK, W C V n , and take the union of all the resulting MM-chains, we obtain the set of non- 
discardable checkpoints. However, this is an exponential algorithm and may be unacceptable 
for applications with large number of processors. We will give another theorem in the next 
section which can further reduce the number of cases. 


VI THE MAXIMUM CHECKPOINT SPACE 
RECLAMATION ALGORITHM 

By applying Lemma 1, we will show that each of the 2 N MM-chains in Theorem 1 can 
be ” synthesized” by N MM-chains. An efficient algorithm is then developed for finding the 
set of non-discardable checkpoints. 

LEMMA 5 Given a poset P = (5, and A,BCS, 

min(A U B) = mm(min(A) U B ). 

Proof. Let min'(A) = A \ min(A). By definition, for each a' € min'(A), there exists 
a (E A such that a X a' and a ^ a'. Since a, a' E A U B, a ' ^ min(A U B). Therefore, 

min(A U B) = min(min(A) U min'(A) U B) = min(min(A) U B). ° 
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LEMMA 0 Given a poset P, M € M(P) and M X M, € M(P) for i 6 [0, k - 1] for any 


finite k. Define 


then 


. /\ Mi = (...((Mo A M x ) A Mi) ... A Mfc-i, 

t€[0,*-l] 


(o 


M X /\ M, e -M(P); 

i'€[0,fc-l] 


( 2 ) 

A Mi = min( (J Mi). 

*€[o,fc— i] ie[o,fc-i] 

Proof. Both parts will be proved by induction on k. 

(1) By Lemma 1, M{P) is a lattice and so M 0 A M x G M(P). Also, M X M 0 A Mi because 

M X Mo, M X Mi and M 0 A Mi is the greatest lower bound of M 0 and M x . We have shown 

the case it = 2 is true. Assume it is true for fc = n - 1, i.e. 


Again, by Lemma 1, 


MX f\ Mi G M{P). 

«6[0,n— 2] 


/\ M, = ( A M ‘) A e • yw ( p )- 

<6[0,n— l] «€[0,n— 2] 


Eq. 2 and M X M n -i implies 

MX A M >- 

• 6[0,n— l] 

Therefore, it is also true for k = n and so we have (1). 

(2) The case it = 2 is true by Lemma 1. Assume it is true for k = n - 1, i.e. 


( 2 ) 


A M, = min( [J Mi). (3) 

i€(0,n-2] i6[0,n-2] 
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Applying part (1), Eq. 3 and Lemma 1, we have 


A Mi = ( A 

t6[0,n-l] t€[0,n-2] 

Lemma 5 further gives that 


[j Mi) U 

«€[0,n-2] 


mzn(min( U M) U M„.i) = min( U ««.-,) = »»( U «)' 

. f ^-21 . 6 ( o .»->] ,el ' J '"-‘ 1 

□ 

Therefore, by induction, part (2) is true. 


THEOREM 2 For every W C V n , 


M’(G -W) = min( IJ M*{G - n,)). 

r new 

Proof. If there are k vertices in the set W, without loss of generality, we may assume 
they are zq, z\, ..., Zk-i, i.e* 


{n« : n,- € W} = {zj : j € [0, Ar — 1]}- 

Since G - Zj G Q S {G - W), M m {G -W)< M*{G - Zj) for all j G [0, k - 1] by Lemma 4. 

Now consider the graph G. G G Gj{G - Zj) implies that M m {G - zf) G M{G) for 
j G [0, k - 1]. Similarly, A/*(G — W) G M[G). By Lemma 6, 

M’(G - W) ■< A M*{G-Zj ) = min{ [J M*(G - Zj)) G MiG). (4) 
je(o,fc-i] i€[o,fc-i] 

Moreover, for every j G [0, k — 1], there exists u G M*(G - Zj) with p(u) = p(*j) such that 
u < Zj. Since u G Ujg[o,k-i] M m iG - Zj), Zj & mm(Uje[o,*:-il M*(G - *j)). We have, by 
Lemma 4, mm(Uj € [o,fc-i] — z j )) G A4(G — W) and 

mini (J M*iG-Zj))±M'iG-W). 

jg[0,fc-l] 

Together with Eq. 4, we have proved 

M‘i&-W) = mini U M*(& - Zj)) = min( \J M*(G - n,)). D 

j€[0,fc-l] n;eW 
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In particular, the current global recovery line, M‘(G), can be obtained by letting W-V, 

M’(G) = min( [^J M m (G — ft,)). 

ie[0,7V-l] 


COROLLARY 1 Given a checkpoint graph G - (V, E), 


Nd(G)= U M*(G-ni) nv 

i6[ 0,,/V-l] 

Proof. For any » € lW-,1 A/-(G - «,) n V, u 6 M'(G - nr) for some i S [0, N - 1). 
Since G- nr € Gl(G), v € N D (G) by definition. Thus Uielojv-rl M'(G- nr)n V C / V„(G). 

Conversely, for any r € No(G), v € V and u G M'(G - W) tor some WCV„ by 
Theorem 1. Theorem 2 further gives that 

v € min( M M*(G-n,))C (J M*(G - n<) C JJ M*(G - n t ). 

n,Vw n,€VV iG[0,7V-l] 

Therefore, Nd(G) C U, e [o,w-i] M *(^ ” n «) n V and so we have 


n d (G) = U ~ n< ) n v - D 

tG[0,W-l] 

The above corollary provides the basis of a good algorithm for finding N D (G). We now 
present the Maximum Checkpoint Space Reclamation (MCSR) algorithm in Fig. 7. The 
algorithm is of complexity 0(N\E\), where \E\ is the total number of edges in the checkpoint 

graph. 

Fig 8 shows the checkpoint graph corresponding to the initial part of execution of an 
N-Queen program written in Chare Kernel language [19] which has been developed as a 
medium-grain, message-driven and machine-independent parallel language at the University 
of Illinois. Some edges that do not affect the result of applying the MCSR algorithm are 
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/* N is the number of processors */ 

/* G and n, are as defined in the beginning of Section V */ 

for each i € [0, N — 1] { i, 

apply the rollback propagation algorithm on the checkpoint graph G - n, to 

find the recovery line; 

add to the set Np(G) the checkpoints in G and on the recovery line; 
ill the checkpoints not in N D (G) can be reclaimed. 

Figure 7: The Maximum Checkpoint Space Reclamation Algorithm 

removed in order to get a clear picture. Denote the set of non-discardable vertices contributed 
by M*{G -rii) as N D i, we have 

Npo = {a}, N D i = {&}, N D2 = { c )> 

Nd 3 — N D4 — Nds — {d, e, /, flf, h,i}. 

showed as the shaded vertices in Fig. 8. Traditional checkpoint space reclamation algorithms 
can not reclaim any of the checkpoints due to the domino effect. It is interesting that the 
MCSR algorithm determines that all non-shaded checkpoints can be reclaimed. 


VII THE MAXIMUM NUMBER OF 
NON-DISCARDABLE CHECKPOINTS 


Traditionally, the checkpoint space reclamation procedure is only performed for the set 
of checkpoints older than the current global recovery line. Since it is possible for the domino 
effect to persist during the program execution, a common perception is that all checkpoints 
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Figure 8: Checkpoint graph corresponding to part of the execution of an N-Queen program 

may have to be kept and the space overhead may be constantly growing as a program 
proceeds. In a sense, this is a more serious disadvantage than the slower recovery due to the 
domino effect because it results in unpredictable space overhead during normal execution. 
Corollary 1 not only identifies the minimum set of non-discardable checkpoints but also places 
an upper bound N 2 on the number of non-discardable checkpoints for a general checkpoint 
graph because each M*(G - n<), * € [0 ,N - 1], consists of N checkpoints. A smaller upper 

bound obviously exists based on the following observation: 

1. M m (G — rii) may contain vertices from ths set V n , but we are only concerned about 

vertices in the existing graph G; 

2. M*(G — nj)’s may not be mutually disjoint; 

3. if the last vertex, e,, on chain C, is maximal, M*(G - n t ) will contribute only a single 
vertex to the set Nd(G), i.e. itself. 

The following lemma addresses the implicit relations among M (G — n») s. 
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LEMMA 7 Let m denote the vertex in M*{G-rn ) from processor pj. For i,j G [0, N - 1] 
and i ± j, if rriij ^ rij and mji ^ n,-, then M*(G — ni) = M’(G — nj). 

Proof, rriij ± nj implies M*(G - n.) C G - n { - rij. By Lemma 4, M*(G - n,-) G 
M{G - Hi - rij ) C M{G - rij ) and so 

M m (G - n.) X M'(G - nj). 

Similarly, mji ^ n{ leads to 

M*(G - nj) X M‘{G - n,). 

Since M(G - ni - nj) forms a poset (Lemma 1), we have 

M*{G-ni) = M*(G-nj). 1=3 

THEOREM 3 For any checkpoint graph G — (V,E), 

woi < 

Proof. By Corollary 1, we only have to consider the N 2 vertices m tJ , i,j € [0,JV - 1]. 
For each i G [0, JV — 1], m„- G V and contributes one vertex to Np{G). Since all the m„-’s 
come from different processors, Np now consists of N vertices. For the remaining A N 

vertices with i ^ j, we consider each pair m,j and mji at a time and there are (N 2 — N ) / 2 

such pairs. We distinguish three cases: 

Case 1: m,_, = n } and mji = ni. Both m,j and mji do not belong to Np(G). 

Case 2: m {j = nj and ^ n t , or m :j ^ nj and m.,; = n,. This pair will possibly add one 

new vertex to Np(G). 

Case 3: m {j ^ n : and mji ± n t . It follows that M’(G - n.) = M*(G - nj) by Lemma 7, 

and so mij = mjj and mji = m„. Since mjj and mu are already in Ad ( G ) , this case 
does not increase the size of Np(G). 
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Therefore, each of the ( N 2 - N)/2 pairs can contribute at most one new vertex to N D (G). 
We then have 

N 2 -N N{N + 1 ) □ 

\Nd(G)\ < N + 2 - 2 

One may argue that the upper bound derived in Theorem 3 is still of the order N\ 
We will next show that N(N + l)/2 is in fact the lowest upper bound, i.e., the maximum, 
because for any N we can construct a checkpoint graph, Gy, as shown in Fig. 9 to achieve 
this upper bound. By applying the MCSR algorithm in Fig. 7, it is not hard to see that all 
the N(N + l)/2 vertices in Fig. 9 are non-discardable. 



Figure 9: G* N : The checkpoint graph with N(N + l)/2 non-discardable checkpoints 

When a checkpoint graph is given, we can further reduce the maximum by counting the 
number of maximal vertices, L, in the set V e (as shown in Fig. 5). Recall that if e, is maximal, 
m„ = e , and my = n, for j ± i. Therefore, in the discussion for each pair of m tJ and m ]{ in 
the proof of Theorem 3, the case when both e, and e i are maximal corresponds to Case 1. 
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The maximum then becomes 


|JV D (f?)l < jv + - ( 2 ) - 

In particular when L = N, |W„(G)| = N, which is corresponding to the case of coordinated 
checkpointing. 

We would like to point out that the MCSR algorithm can be further improved by applying 
Lemma 7. Inside the loop in Fig. 7, suppose we have found the recovery line M‘(G - n { ). 
Define the index set T as 

T = {j : ± rij, j 6 [0,N — 1] and j > i}. 

Then for each later loop index j e T, the rollback propagation algorithm can be aborted 
when any checkpoint from processor p, is marked. Because this would mean m Jt ^ n, and 
M m (G - nj) is exactly the same as M"(G - n^). 


VIII RECOVERY USING VIRTUAL 

CHECKPOINTS 


In this section, we want to show that our results in previous sections can also be applied 
to Scheme A, i.e., recovery using virtual checkpoints. When a processor p, detects an error 
and decides to rollback, all the computation after its latest checkpoint are assumed invalid. 
However, for each of the non-faulty processors, the state residing in the volatile storage is still 
valid and can serve as a new checkpoint to advance the recovery line. Since the checkpoints 
belonging to the current global recovery line will still remain on the stable storage, the virtual 
checkpoints can be discarded after the recovery and disappear from the checkpoint graph. 
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The effect of performing such recovery on the checkpoint graph is described as follows. The 
existing checkpoint graph G = (V,E) is augmented by the set of virtual checkpoints at the 
time of recovery and becomes graph G' . M*(G') is then obtained by the rollback propagation 
algorithm. Processes roll back according to A/*(G') and the vertices belonging to the set 
p'(M m (G'))\ M*(G') are deleted from G', together with all the associated edges. Finally, all 

the vertices representing the virtual checkpoints are deleted. 

Qur approach to solving this more general problem when the checkpoint graph is not 
always growing is described as follows. First, we apply the results from previous sections 
to the checkpoint graphs before the first rollback. Since G' € Gf{G), M*{G') will not 
contain any discardable checkpoints of G. Next, we again apply previous results to the 
checkpoint graphs after the first rollback and before the second rollback to determine the set 
of non-discardable checkpoints. If we can show that the rollback procedure does not make 
any discardable checkpoint become non-discardable, i.e., needed for any possible second 
rollback, the MCSR algorithm is then valid for Scheme A as well as for Scheme B because 
once a checkpoint is determined to be discardable it will remain discardable no matter how 
many times rollback recovery occurs in the future. Note that if the set of deleted vertices can 
be arbitrary, the above statement may not be true. For example, for the checkpoint graph 
G shown in Fig. 10(a), vertex CP 12 is found to be discardable by the MCSR algorithm. 
However, in the graph G - CP 02 shown in Fig. 10(b), CP 12 becomes non-discardable. 

The proof of Theorem 1 characterizes, as a byproduct, the possible sets of deleted vertices 
due to rollback. That is, for any G‘ € G/(G), M m (G') FI V C Af*(G — T) where 

T = {n p( u) • u £ M m (G') fl V}. (5) 

Define a strict filter corresponding to such T as 

F,{T) = JF(M*(G - T)) \ M*(G - T) 
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Figure 10: Discardable checkpoint CP\i becomes non-discardable after the removal of check- 
point CP 0 2 - (Shaded vertices are non-discardable according to the MCSR algorithm.) 

and denote 

G(T) = G - ^(T), (6) 

it suffices to prove that N D (G(T)) C N D (G). Parallel to the definitions of e„ n t , V n and 
G for the graph G, we define e', n', V n ' and G(T) for the graph G(T). Clearly, for each 
i G [0 ,N - 1] and n { & T, e< = e< and n' = n,. Also define 

V = {n' : n, G T}. 


LEMMA 8 For every W QV n \T, 

G — (T Li W) G g f (G{T)-{T'liW)). 


Proof. Clearly, by definition, 


(G - T) - T S {T) = G(T) - T. 

Since V n \ T C V n ', W C V n ' and so (G - (TLiW)) - T,{T) = G(T) - {V U IV). Now we have 
to show all the vertices in JF,(T) and their associated edges can be added to G(T ) — (T'U VF), 


26 


following Rule 1 and 2, to obtain G-(TUW). Rule 1 is obviously satisfied. By always adding 
the smaller vertices first, Rule 2 is enforced among the vertices in f 3 {T) during the processor. 
Suppose Rule 2 is violated when v G E S (T) is added, i.e. there exists u G G(T ) - (T U U ) 
such that an edge is drawn from v to u. Clearly, u & \ {T' U W). If u G G(T), v < u 

and v G T 3 (T) implies u G ? 3 (T) by the definition of E,(T). This contradicts the fact that 
G{T) n T a (T) = 0. Therefore, Rule 2 is also satisfied. LJ 

THEOREM 4 Given a checkpoint graph G = (V,E), for every G' G Q/(G), 

N d (G(T)) c N d (G) 

where T and G(T ) are as defined in Eqs. 5 and 6 . 

Proof. If u G N d (G(T)), V G M'(G'(T)) for some G'(T) G Q f {G{T)). Denote G(T) = 
(V(T),E(T)) and G'(T) = (V'(T), E'(T)). Let M*(G?(T)) = M, U M 2 where M, = 
M m {G'{T )) n V(T) and M 2 = M'{G'{T)) \ M x . Clearly, v G M x . Also define 

A /3 = {ep (u) : u G M 2 and. n p(u) G T}. 

Decompose 7 n as K = 5i U U T where B\ = {n p ( u ) : u G M x } \ T and i ? 2 — {n p ( u ) . u G 
M 2 } \ T. Fig. 11 illustrates the above notation. We want to prove that 

A/j U M 3 U B 2 = M’(G — (TU Bi)). 

First we show M X U M 3 U B 2 G M(G - (T U Bi)). Recall that M 3 U B 2 C Af*(G - T) 
and so M 3 U B 2 G >l(G - T). Since G - T G ^/(G(T) - T') by the above lemma and 
(M 3 UB 2 )n^,(T) = 0, we have M 3 UB 2 G A(G(T)—T') by Lemma4. By the same argument, 
M 3 U B 2 G — (T' U Bi)). Following the proof of Theorem 1, we have M\ fl Tfw) = 0 

for all w G M 3 U B 2 . Since all the vertices in M 3 U B 2 are maximal in G{T) - ( T ' U B x ) 
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(Fig. 11(b)), Mi fl F(Ms U 5 2 ) = 0 and so Mi U M 3 U 5 2 € X(G(T) (T U B x )). It follows 
that Mi U M 3 U B 2 e M(G - (T U 50) again by applying Lemmas 8 and 4. 

Now suppose M!UM 3 U5 2 * M*«7-(TU5 a )). Since G-T 6 ^-(TUft)), Lemma4 
gives M*(G - (T U 50) ■< M’(G - T ). By Lemma 2, the facts that M 3 U 5 2 C M*(G - T) 
and M a U M 3 U 5 2 € A<(G - (T U 5 a )) implies M 3 U 5 2 C M*(G - (T U 50). Therefore, 

there must exists (Fig. 11(c)) 

M[ = M*(G -(TU 50) \ (Ms U 5 2 ) 

such that Mi i A/; aud M, # M[. We now have ^(M!) £ F(Mi). Together with M, n 
= 0, we get M, n = <h and so M[ U Mi £ M(G'(T)). This is a contradiction 

to M'(G'(T)) = Mi U M 2 because Mi U M 2 < M[ U Mi and Mi U Mi ^ M[ U Mi. Therefore, 
we must have Mi U Ms U 5 2 = M (G (T U 5i)). 

Finally, 

veMi c M‘(G - (T U 50) C N d (G) 
and so N d (G(T)) C Nd(G). 


IX CONCLUSIONS 

The problem of finding recovery lines for message-passing systems using independent 
checkpointing is formulated as determining the maximum maximum-sized antichains of par- 
tially ordered sets. We present a method for predicting the possibility for a checkpoint to 
become a member of some recovery line in the future, and show that some of the checkpoints 
will never be needed for recovery so their space can be reclaimed. Based on the algo- 
rithm for finding the recovery lines, a maximum checkpoint space reclamation algorithm, 
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with complexity linear in the number of processors N and linear in the number of edges in 
the checkpoint graph, is developed for determining the set of non-discardable checkpoints. 
The maximum, N(N + l)/2, of the number of non-discardable checkpoints for an arbitrary 
checkpoint graph is also derived to show that the space overhead for maintaining multiple 
checkpoints is bounded even when the domino effect persists during program execution. 
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Figure 11: Suppose (a) Mi U M 2 forms the MM-chain of G'{T). Then Mi U M 3 U B 2 is an 
M-chain of G(T ) — (T' U Bi) and forms the MM-chain of G — (T U B\ ). 
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